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Abstract — Adaptive dynamical systems arise in a multitude 
of contexts, e.g., optimization, control, communications, signal 
processing, and machine learning. A precise characterization 
of their fundamental limitations is therefore of paramount 
importance. In this paper, we consider the general problem of 
adaptively controlling and/or identifying a stochastic dynamical 
system, where our a priori knowledge allows us to place the 
system in a subset of a metric space (the uncertainty set). We 
present an information-theoretic meta-theorem that captures 
the trade-off between the metric complexity (or richness) of 
the uncertainty set, the amount of information acquired online 
in the process of controlling and observing the system, and 
the residual uncertainty remaining after the observations have 
been collected. Following the approach of Zames, we quantify 
a priori information by the Kolmogorov (metric) entropy of 
the uncertainty set, while the information acquired online is 
expressed as a sum of information divergences. The general 
theory is used to derive new minimax lower bounds on the 
metric identification error, as well as to give a simple derivation 
of the minimum time needed to stabilize an uncertain stochastic 
linear system. 

I. Introduction 

What is adaptation? What is learning? These two questions 
arise all the time in practically any discussion of complex 
systems exhibiting complex behaviors. In control theory, 
these notions were a consistent theme in the work of George 
Zames (see, e.g., [1] and references therein), who has put 
forward the following theses: 

1) Adaptation and learning involve acquisition of infor- 
mation about the object (system) being controlled. 

2) The appropriate notions of information are metric, 
locating the system in, say, a ball in a metric space. 

3) Acquiring information takes time. 

4) Nonadaptive (or robust) control optimizes performance 
on the basis of a priori information, whereas adaptive 
control is based on a posteriori information acquired 
online. 

In this paper, we take up the problem of characterizing 
the fundamental limitations of adaptive stochastic dynamical 
systems following the programme of Zames. We start by 
presenting a "Meta-Theorem" that ties together the three 
kinds of information mentioned by Zames: a priori infor- 
mation, represented by the metric complexity of the class 
of systems of interest; information acquired online as the 
system is being controlled; and a posteriori information, 
pertaining to the difficulty of identifying the system after 
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a given length of time. Roughly speaking, given an arbitrary 
class of systems, an arbitrary controller, and an arbitrary 
identification algorithm, the Meta-Theorem quantifies the 
interplay and the trade-off between the initial uncertainty 
about the system, the online performance of the controller, 
and the final uncertainty remaining after the control task had 
been carried out. 

We follow Zames in two key respects: 

1) We adopt the Kolmogorov entropy [2] as our measure 
of a priori uncertainty (or complexity) of the class of 
systems at hand. 

2) We compare this initial uncertainty against the uncer- 
tainty remaining after the control signals have been 
applied. 

However, the novel aspect of our approach is the way 
in which we quantify the process of online information 
acquisition — namely, through Shannon's information theory 
[3]. Conceptually, our methodology is close to the way 
information-theoretic tools are being used in mathematical 
statistics to derive minimax bounds on the risk of statisti- 
cal estimation procedures (see, e.g., [4]-[6] and references 
therein). The difference between statistical estimation and 
adaptive control, however, lies in the fact that, in control, 
we actively intervene into the system in order to steer 
it towards some desired state (control proper) or to learn 
something about the system (system identification). When 
we do not possess complete knowledge of the system, these 
two objectives may be in conflict, giving rise to the so-called 
dual effect of control [7]. With the exception of experimental 
design [8], [9] (and, in particular, some work connecting 
it with control [10], [11]), statistical estimation involves 
passively observing sample paths of a random process for 
the purpose of inference. Our Meta-Theorem covers both 
estimation and control, since the former can be viewed as 
an application of a control strategy that has no effect on the 
system, and it provides a way of quantifying the dual effect 
in the latter. 

Following the statement and the proof of the Meta- 
Theorem in Section IIV1 we show how it can be used to 
derive (a) fundamental limits on the performance of system 
identification from input-output data, and (b) a lower bound 
on the minimum time needed to adaptively stabilize an 
uncertain linear system. 

For system identification, we derive a minimax lower 
bound on the metric identification error, which shows that 
the intrinsic difficulty of identifying a system is determined 
by the balance of a priori metric information and the rate at 
which a posteriori information accumulates over time. We 



also show that ease of identification implies small a priori 
uncertainty. These results apply to any controller and any 
identification algorithm, providing yet another quantitative 
illustration of the dual effect. Bounds of similar flavor were 
derived by Yang [6] in the context of statistical estimation 
from i.i.d. samples, and our techniques combine those of 
Yang with a more careful accounting of the accumulation of 
information during control/identification. 

As for adaptive control, the first lower bounds on the rate 
of convergence in adaptive control are due to Nemirovski 
and Tsypkin [12] (see also [13] for further references), and 
we consider the same set-up. However, the proof in [12] is 
rather lengthy and relies on the Cramer-Rao inequality. By 
contrast, we use the Meta-Theorem, which results in a much 
simpler and more direct information-theoretic argument. 

II. The ingredients: systems, controllers, 

IDENTIFICATION ALGORITHMS 

A stochastic dynamical system is specified by a sequence 
of stochastic kernels relating present and past inputs and 
outputs to future outputs. The system is initially unknown, 
apart from the fact that we can place it in some uncertainty 
set, which is a subset of a metric space. The system is 
interconnected with a controller, which generates the inputs 
given past inputs and outputs. The exact purpose of control 
can be completely arbitrary, but we stipulate that the con- 
troller has been designed only with the knowledge of the 
uncertainty set. Finally, we consider the possibility that the 
observed temporal evolution of the system (i.e., its input- 
output trajectory) may be fed into an identification algorithm 
with the purpose of locating the system in a "small" region 
of the uncertainty set. 

Specifically, we consider discrete-time stochastic dynam- 
ical systems with input space U and output space Y (all 
spaces are assumed to be standard Borel [14]). The dy- 
namics are assumed to be causal and nonanticipative, and 
so can be represented as a sequence of stochastic kernels 
{P$(dyt\y t ~ 1 , u t_1 )}^i, where is a parameter that takes 
values in some metric space (Q,p) and, for each t, 

Pr (Y t e sjr*- 1 = y*- 1 , f/'" 1 = u*" 1 ) 

= / P^ly'-V" 1 ) (1) 

for every Borel set B C Y. The inputs are generated by 
a controller, which is itself a dynamical system described 
by a sequence of stochastic kernels {Q 7 (dut|t/*, u t ~ 1 )}^l 1 , 
where 7 is a parameter that takes values in some space T that 
indexes the admissible controllers (e.g., open-loop, affine, 
Lipschitz, Markov, stationary, etc.). The system 9 and the 
controller 7 are interconnected to form the joint probability 
law n e , 7 of {{Y u U t )}^i on (Y x U)°°, so that for each 
T e N we have 

Ug n (dy T ,du T ) 
T 

= (g) Q 7 (du t I y* , «*- 1 ) ® P e (dyttf- 1 ,u t ~ 1 ). (2) 
t=i 



Finally, we consider identification algorithms that observe 
the system trajectory (Y\, U\), (Y2, U2), ■ ■ ■ and attempt to 
estimate the true system model 0. We will consider deter- 
ministic identification algorithms, so for each T we define 
a T-step identification algorithm as a measurable mapping 
« T :Y T xU T 49. 

III. Prelude: identification error and metric 

COMPLEXITY 

As stated earlier, we assume some a priori knowledge 
about the system of interest, namely that it lies in some 
uncertainty set A C 6. Since our primary interest is in 
capturing the interplay between identification and control, 
we need to quantify the extent to which the systems in A 
can be identified after having been interconnected with a 
given controller 7 from t = 1 to t = T: 

Definition 1. Consider a subset A C of system models 
and a controller 7. Then the T-step minimax identification 
error on A relative to 7 is given by 

e T (A, 7 ) = MsupE e ,J p (§ T (Y T ,U T ),0) \ , (3) 

6 T 8eA L v 1 } 

where the infimum is over all T-step identification algo- 
rithms. 

The fact that the minimax identification error depends not 
only on the uncertainty set A, but also on the choice of the 
controller 7, is of key importance. The dependence on A ex- 
presses the fact that some classes of systems are intrinsically 
more difficult to identify than others; the dependence on 7 
captures the potential tension between control and identifica- 
tion/learning (the dual effect [7]). When system identification 
is the sole purpose, the controller 7 is typically open-loop 
[1], [15], and the underlying deterministic sequence of inputs 
is chosen based on some criteria related to the structure 
of the uncertainty set, as well as to other constraints (e.g., 
stability, power, cost, etc.). However, there are also adaptive 
control strategies that adjust the behavior of the controller 
dynamically based on parameters estimated online [16], [17], 
and our definition of e:r(A, 7) covers this possibility. 

The basic idea, which in the context of control originated 
with Zames, is that the difficulty of identification is bound 
up with the richness of the uncertainty set A — the larger the 
uncertainty set, the harder it is to identify the system. We will 
combine this intuition with a probabilistic argument to show 
that, in a certain sense, system identification is no easier than 
hypothesis testing. Arguments of this sort are quite common 
in statistics [4], [5], but, as we shall see, they are equally 
applicable to control as well. To get things going, we start 
by proving a simple lower bound on 6t(A, 7): 

Proposition 1. Let S be any finite e-separated subset of A, 
i.e., for S = {9i,...,9 N } 

p(0i,0i)>e, Vi^j. (4) 

Let It{S) denote the set of all T-step identification algo- 
rithms taking values in S, i.e., It{S) = {9t ■ Y T xU T 4 



S}. Then 

er(A,7) > J inf maxn e>7 {<? T (Y T ,[/ T ) ^ 

2 e T ai T (s) l > 



(5) 



Proof. Using the fact that S C A and Markov's inequality, 
we can write 

er(A,7) > J tof maxll e , 7 \p{9 T ,9) > e/2} . (6) 



Given an arbitrary Ox, define 



T = argminp(07',0') 
S'es 



(7) 



Clearly, <? T G T T (S). Suppose G 5. If p(6 T ,0) < e/2, 
then necessarily p{6x,6x) < If 0t 7^ the triangle 
inequality gives 



P0t, B T ) > p(9t, 9) - p(9 T , 9) > e/2, 



(8) 



which is a contradiction. Hence, if 9t ^ 9, then p(9x, 9) > 
e/2. Thus, 



maxTLe,-y{p0T,8) > e/2} 

9,1 fa + 9} 



> maxllfl 



> inf max fig i7 



{9 T + 9} 



Combining this with ©, we get (0. 



(9) 
(10) 
□ 



The above proposition suggests a trade-off between the 
separation e and the probability of correct identification. 
Indeed, if we make e small, then the size of the maximal 
e-separated subset will be large, which in turn will tend to 
increase the probability of identification error. This obser- 
vation naturally prompts us to take a look at the growth 
of maximal separated subsets of A as a function of the 
separation e, which is captured by Kolmogorov's notion of 
the metric entropy [2]: 

Definition 2. Given a set A C O, we define its packing 
numbers by 

N p (e;A) = max jiV > 1 : 

39 1 ,...,9 N eAs.t. p(^,%)>e,V^i} (11) 

and the corresponding Kolmogorov entropy by H p (e;A) = 
\ogN p (e;A). 

IV. The Meta-Theorem 

Now that all the ingredients are in place, we can state and 
prove our Meta-Theorem, which captures the interplay be- 
tween the metric complexity of the uncertainty set A (a priori 
information, as per Zames), the information acquired online 
by acting on the system and observing its response, and the 
uncertainty remaining after T time steps. The main idea is 
to embed the problem of adaptive control and identification 
in a "doubly stochastic" set-up, in which Nature first selects 
a system at random from an e-separated subset of A, and 



then this system is interconnected with a given controller 
and fed into a given identification algorithm. The Meta- 
Theorem applies to any uncertainty set, any controller, and 
any identification algorithm. Our usage of the prefix "meta" 
is intended to draw parallels to recent work of Polyanskiy et 
al. [18], [19], which develops a "meta-converse" for channel 
coding by relating the performance of any channel coding 
scheme on one channel to its performance on another (we 
will elaborate on these parallels shortly). 

Given a separation e > 0, let A e = {0i, . . . ,9^} C A, 
N = N p (e;A), be any maximal e-packing set, and suppose 
that the system model is drawn uniformly at random from A E . 
Then this system is interconnected with a given controller 7. 
To describe all the events pertaining to this interconnection, 
we construct a probability space (Q, B, P) with the following 
random variables defined on it: 

• W G [N], the random choice of a system model in A e 

• U T G U T , the inputs applied to the system by 7 

• Y T G Y T , the resulting outputs. 

These variables describe the interaction between the system 
and the controller, and thus have the causal ordering 



W,Y 1 ,U 1 ,...,Y t ,U t ,...,Y T ,U T , 
where, P-almost surely, 

P(W = i) = — ,Vi e [N] 

F(U t G A\W, Y\ U*- 1 ) = Q^(A\Y*, U 1 - 1 ) 
F(Y t g B^Y*- 1 ,^- 1 ) = P ew (B\Y t - 1 ,U t - 1 ) 



(12) 

(13) 
(14) 
(15) 



for all Borel sets A C (J,B C Y. In other words, W -> 
(Y*, C/*^ 1 ) — > Ut is a Markov chain for each t. To simplify 
notation, let us denote by Z t the pair (Y t ,Ut). At time 
T the entire sequence Z T = (Z\, . . . , Zx) is fed into an 
identification algorithm 9t- 

With these definitions, we are now in a position to state 
the Meta-Theorem: 

Theorem 1. Consider any controller 7 and any T-step 
identification algorithm 9? G It(A £ ). Then the bound 

H p (e; A) • min Ug n {d T = 6>j 



T 

<E^( 

*=i 



■ Yt\Z*-i,W\\V!.Y t \Z*- 



) +log2 (16) 



holds for any sequence of stochastic kernels { < QY t \z t - 1 }tLi 



that satisfy the condition Fy^z*- 1 Qy^z*- 1 >Vt 
Proof. We start by observing that 

maxn 97 (§ T ± 9) > m£v\w + wX , (17) 

0£A e I J w L > 

where the infimum is over all estimators W : Y T x U T — > 
[N]. Since any such W is <r(Z T )-measurable and since W is 
uniformly distributed on [N], we can apply Fano's inequality 
[3], [20] to write 

I{W-Z T ) +log2 



inf P{W ^ W} > 1 

w 



loeN 



(18) 



where I(W; Z T ) is the mutual information between W and 
Z T = (Y T ,U T ) under P. We now expand this mutual 
information: 

T 

I(W;Z T ) = Y,HW;Z t \Z t - 1 ) (19) 

t=i 

T 

= Y J I{W]Y u U t \Z t - 1 ) (20) 
t=i 

T 

= ^(W; YtlZ*- 1 ) + I(W; U t \Y t , Z*- 1 )} (21) 



t=i 

T 



(22) 



where the first three steps follow from the repeated applica- 
tion of the chain rule, while the last step uses the fact that 
W — > (Yt^Z*^ 1 ) -> U t is a Markov chain. Now, for each 
summand in (l22ii we have 



E { log ■ 



Y t \Z*- 



Y t \Z* 
1 ,W 



E <Mog ■ 



\ t \z*- l ,w 



Dl 



-D( 



E <Mog ■ 



Y t \Z* 



< D(] 



£Y t \zt- 
Vtiz'-i^llQytiz'- 



- 1 ,w, 



lYAZ* 



-0 



(23) 
(24) 

(25) 

(26) 
(27) 



where the first two steps use the definition of conditional 
mutual information, the next step follows from the fact that 
IVtlz*- 1 *^ Qi^lz'- 1 f° r every t, the step after that uses 
the definition of conditional divergence, and the last step 
follows because the divergence is nonnegative. Combining 
everything, we obtain the desired bound (fT6] i. □ 

Note that the left-hand side of ( [ToT l involves the initial 
amount of uncertainty about the system (the metric entropy) 
and the best identification error performance at time T, while 
the right-hand side is a sum of information divergences added 
up from t = 1 to t = T, The main power of the Meta- 
Theorem resides in the freedom to choose the auxiliary 
stochastic kernels {Qy^z*- 1 }f =1 . For example, we may 
consider the case in which 7 is designed for some "nominal" 
system 9 G 0, and we can take Qy^z*- 1 to be the transition 
law of 6>o controlled by 7. With this choice, the <th term on 
the right-hand side of ( TToT l quantifies the "robustness radius" 
of 7 on A at time t. Alternatively, we may consider the 
setting, in which there is an optimal controller •yg associated 
to each 9 6 O, and 



Ue^idYtlZ*- 1 ) 



Hv^idYtlZ* 



(28) 
-1 to be 



for all 6,9' G A. In that case, we may take Qy t \z 
the controlled transition law of 9 interconnected with 70 (for 
any 9). With this choice, the fth term on the right-hand side 
of ( fTSI ) tells us by how much the actual performance of 7 



operating in the presence of uncertainty differs from that of 
the optimal controller at time t when there is no uncertainty. 
In general, the use of an auxiliary sequence of Q-kernels is 
similar to the use of auxiliary channels in the information- 
theoretic "meta-converse" of Polyanskiy et al. [18], [19]. 

The remainder of the paper is devoted to several sample 
applications of the Meta-Theorem, intended to showcase its 
power and flexibility. 

V. Fundamental limits of identification 

Our first application of the Meta-Theorem concerns the 
fundamental limitations of system identification algorithms. 
For the results of this section, the precise structure of the 
controller 7 is irrelevant, and the influence of 7 manifests 
itself indirectly through time-dependent bounds on the metric 
identification error. For notational simplicity, we will denote 



by Pg.t the stochastic kernel Pg(dy t \y 



t-l „,*-!' 



, where it is 



understood that Pg, t is a Borel probability measure on Y and 
a Borel-measurable function of (y*^ 1 , it* -1 ). 

The nature of the results presented below, and the tech- 
niques used to prove them, are inspired by the work of 
Yang [6] on the limits of regression learning procedures 
in statistics. Moreover, the statistical estimation setting is 
subsumed by our results since a stochastic process with 
sample paths in Y°° and with parameter 9 G can be viewed 
as a dynamical system {Pg(dyt\y t ~ 1 )}'^ 1 (i.e., the controller 
does not affect the system). 

A. The Critical Separation bound 

The first result we prove is a lower bound on the T- 
step minimax identification error, which is expressed in 
terms of upper bounds for a sequence of t-step identification 
algorithms, from t = (i.e., any data-free guess about the 
system parameter 9) to t = T — 1: 

Theorem 2. Consider a model class A and a controller 7. 
Suppose that there exists a sequence {9t\t=a of identifica- 
tion algorithms, such that 



sup Eg^D[P S j 



Pa 



<5 t 



W. 



Then 



er(A,7) > =f, 
where the critical separation g_ T is chosen so that 



H p {a_ T ; A) 



'5t +log2 



(29) 



(30) 



(31) 



Proof. Consider the setting of Theorem Q] with the given 
A, 7 and e = g_ T defined according to OTT ). For each t, let 
QyAz*- 1 be defined via 



' t eB\z t -') = P §ti(zt _ l) {B\z t -') 



(32) 



for any Borel set BCY. Then 

D{Vy tlZ t-i, w \\® YtlZ t-i\F Z t-i !W ) 

N 



where a T is chosen according to (f3Tb : 



\W = i)D[Pt 



N ^ 

i=l 

< SUp / Ug^dz^DfPg 
flcA ' V 



9eA 
<S t . 



(33) 

(34) 

(35) 
(36) 



Then, for any #t taking values in , 



H p (a T ; A) min Ug^ \§ T = 9} < V S t + log 2. (37) 

Combining this with (|3"TT > and noting that #t was arbitrary, 
we get 

inf max II e 7 (<9 T ^ <?} > i. (38) 

Finally, substituting this into the lower bound Q, we get 
d30]l. □ 

B. Easy identification implies small a priori uncertainty 

We now use Theorem [2] to prove that any class of systems 
that are easy to identify (in the sense that there exists 
a sequence of identification algorithms whose worst-case 
errors over the class decay at some prescribed rate) must 
necessarily have correspondingly small metric entropy. In 
other words, if a class of systems is easy to identify, then its 
a priori uncertainty could not have been very large. 

To formalize things, consider a controller 7, a sequence 
of identification schemes {9 t }^ , and a nonincreasing se- 
quence of positive reals |/3t}^ . For a given k > 1, let 
us define the set A fe (7, {9 t }^L , {/3 t }£I ) to consist of all 
systems 9 e A, such that 

Eg„p k (6 t ,9) < p u VI (39) 

Theorem 3. Suppose that 7 is such that, for all t and all 

9,9' e 6, 



Eg„D(P e , t \\Pe>,t) < Kp k (9,9') 



(40) 



for some K > 0. Then the class A = Afc(7, {9}t, {/3t}) 
satisfies the bound 

( T N 

2 iT^/3 t „ 1 +log2 I (41) 



# p (5,4 /fc ;Ai ; 



for every T. 

Proof. From the smoothness condition (l40b it follows that 

E e , 7J D(P e , t ||P eViit ) <#A_i (42) 

for every t > 1. Hence, applying Theorem |2] with (5* = 

K(3 t -i we get 

eT(A, 7 )>^, (43) 



H p (a T ;A) 



log 2 



T=l 



(44) 



Let iJ^ denote the quantity on the right-hand side of (1441) . 
Let us suppose that H p (5/3^/ fc ; A) > Ht- Then, because 



(45) 



the mapping e i-> H p (e; A) is monotone decreasing, we must 
have 5/3^ fe < ct t . But that implies that 

ex(A )7 )>f >^>4 /fc - 
On the other hand, for any OeAwe have 

Ee„p0 t ,O) < (&g nP k {6 u 6) 



(46) 



where the first step uses Jensen's inequality and the second 
step uses the definition of A. This implies, in turn, that 

e T (A, 7) < Eg^p(e T , 9) < fi/ k , (47) 

which contradicts g5j. Hence, H p {5^ /k ;A\ < H T . □ 

As an example of when the smoothness condition ( f4Qb 
holds, consider a first-order nonlinear system of the form 

Y t = f e (Y t - l ) + Ut- 1 + Vu (48) 

where Y = U = R and {Vt} is an i.i.d. sequence of Gaussian 
random variables with zero mean and variance a 2 . Suppose 
that the mappings fg satisfy the condition 

\fe(y)-fe<(y)\ 2 <K F(y)p k (9,9>), V0, 9' g 9 (49) 

for some Kq > 0, k > 1, and some function F : R — >• R 
which is bounded on compacts. Then, provided 7 is chosen 
so that there exists some finite R > 0, such that \Y t \ < R 
Tig 7 -almost surely for every 9 g 9, we will have, for any 
9, 9' g 9 

1 



E flir D(P fl , t ||iV it ) 



< 



2a 2 



i^lM^)-M^)r (50) 



2(T 2 y| <7? 



max F(y) ■ p k 



(51) 



To appreciate the implications of the above result, we can 
consider the following cases: 

1) (3 t < Ct- a for some C > and < a < 1. Then, for 
all sufficiently small e, we will have 

2(l-c) 

H p (e;A) < C (-) , (52) 



where C > is a constant that depends only on 
K,k,a,C. In this case, the metric complexity of A 
is, essentially, that of a ball in an infinite-dimensional 
Hilbert space. 

2) /3t < Ct^ 1 for some C > 0. Then, for all sufficiently 
small £, we will have 

H p (e;A) <C'fclogi (53) 

where C > is a constant that depends only on 
K,k,C. In this case, A is, essentially, a ball in a finite- 
dimensional Hilbert space. 



VI. Rates of convergence in adaptive control 

In this section, we will use the Meta-Theorem to derive a 
fundamental limit on the minimum time needed to achieve 
a particular control objective. 

Consider the problem of adaptively controlling a first-order 
n-dimensional linear system 



Y t+1 =AY t + U t + V t+1 , 



t = 1,2, 



(54) 



where U = Y = R", {U t }^L 1 is the input (control) sequence, 
{Y t }^L 1 is the output sequence, and {Vt}^ is an i.i.d. 
Gaussian disturbance process with zero mean and covariance 
matrix <T 2 I nxn , independent of the initial state Y\. We 
assume that the initial state Y\ has a finite second moment, 
E||yi|| 2 = C < oo. The unknown system matrix A G R nxn 
is assumed to lie in the set 



A = {A G 



\A\\<1}, 



(55) 



where || • || denotes the operator (spectral) norm. The space of 
admissible controllers V is assumed to consist of sequences 
7 = {jt}t^i of deterministic Borel mappings j t : Y* x 
U*" 1 -> U, so that U t = 7 t (y*,J7* _1 ). The objective is to 
select a control law 7* G T such that 



lim sup Ey^-y* 




inf limsupE A , 7 {^||y t+1 || 2 
7er T— >oo IT^J 



(56) 



for every A e A. 

Following Lai [21], we can define the T-step regret of 7 
on A by 



R T ( 1 ,A)^E A , 1 \j2\\Yt+i-V t 



t+i\ 



(57) 



Since Y t +i — Vt+i is independent of Vt+i, we can write 

E||F t+1 || 2 = E||Y t+1 -V t+1 \\ 2 +na 2 (58) 



= E\\AY t + U t \\ 2 
> no 2 . 



(59) 
(60) 



This implies that the the infimum on the right-hand side of 
(l56*l l is equal to na 2 ; consequently, we seek a 7* such that, 
for all A e A, 



,. Rt(i*,A) . grCT^j n 
limsup = ml limsup = I). 

T->oo 7 1 T T->oo 

Lai [21] calls any such 7* asymptotically efficient. 
Given a controller 7 G T, let us define the quantity 



(61) 



T*(e) = sup inf < T > 1 



Rt{i,A) 
T 



< e 



(62) 



This is the minimum time it takes 7 to achieve average regret 
of less than e on every A £ A. We will obtain a lower bound 
on T*(e) for any 7 that has a certain property known as 
persistent excitation (cf. [13], [17], [21], [22]): 



Definition 3. Given c > and 8 & (0, 1), a controller 7 S 
r /zas f/ze (c, (5) -persistent excitation property if there exists 
some To G N such that, for every A G A, 

n A>7 ^ W ^ c/ "X") - 1 _ 5 ' VT ^ T ° (63) 

where for any two Mi,M 2 G K" x ™ f/ze notation Mi > M 2 
means that Mi — M 2 is a positive semidefinite matrix. 

Our main result is as follows: 

Theorem 4. Any controller 7 G T that has the (c, 5)- 
persistent excitation property with S < 1/4 must satisfy 



T*{e) = Q 



o , 1 
log- 

£ e 



(64) 



where the constant implicit in the fi(-) notation depends only 
on c and S. 

Proof. We first show that any good controller can be used 
to construct a good identification scheme. The proof of this 
assertion essentially follows Nemirovski and Tsypkin [12]. 

Given a controller 7 = {7*}, we first note that the 
probability that any component of Y t vanishes is zero. Hence, 
without loss of generality for every t we can write 



lt {Y\U t - 1 ) = -F t {Y\U t - 1 )Y u 

for some measurable mapping Ft : Y* x U* _1 
for each T let 



a.s. (65) 
I" x ". Now 



t=l 

and consider the following least-squares identification algo- 
rithm: 



if det G T = 



An 



0, 

T 

^FtiY^Ut-^YtY^Gr 1 , otherwise 
. t=i 



(67) 

For this identification algorithm, we have the following 
lemma, whose proof is presented in Appendix U 

Lemma 1. Suppose 7 has the (c, 5)-persistent excitation 
property. Then for every A £ A and for every T > T a , 



\An 



All 2 < 



1 T 



V t+ i 



(68) 



with ILa 7 -probability at least 1 — S. 



Next we show that if 7 achieves average regret of less 
than e in T time steps, then the corresponding identification 
scheme At must have a small probability of error. 

Given e, let jViui (e; A) denote the e-packing number of A 
w.r.t. the metric induced by the spectral norm. Since A is a 
norm ball in 1" , there exist constants b n , c n > 0, such that 

b n + n 2 log - < (e; A) < c n + n 2 log - (69) 

e 1 1 e 



for all sufficiently small e > 0. Now let N(s) — 
Nn.AAJsJc; A) and take {Ax,..., An) C A to be a 



II - II 

maximal 4-y/e/c-packing set. Given a controller 7, define 

= argmin ||1 T -^||. (70) 

l<i<N(e) 

Then we have the following lemma, whose proof is given in 
Appendix ITT1 

Lemma 2. Suppose that 7 has the (c, 6)-persistent excita- 
tion property and achieves regret < e in time T. Let W 
be a random variable uniformly distributed over the set 
{1, . . . , N(e)} independently ofY\, {Vt}. Then the estimator 
(|70T > satisfies 



w^w) < \ + s< -. 

' 4 2 



(71) 



To finish the proof, we now apply the Meta-Theorem. 
For each t, let QyAZ*- 1 = Qv* be the normal distribution 
N(0,a 2 I nXn ). Then 



D 



z t - l ,w J 

2 



—xEWAwYt-i + Ut-i 
2a 

±K\\Y t -V t \\ a . 



(72) 
(73) 



Then 



1 t 



(74) 



1 



2 ' ^ sup R T (7, ^4)+ log 2 (75) 



< 



C 



log2 



2cr 2 AeA 
2^2- 



(76) 



Rearranging, we obtain d64l i. and the theorem is proved. □ 

VII. Conclusion 

We have presented a Meta-Theorem on the inevitable 
trade-offs between a priori uncertainty, a posteriori uncer- 
tainty, and the information accumulated online in the process 
of controlling an unknown stochastic dynamical system. 
The Meta-Theorem connects the notions of information, 
learning, and adaptation in the sense of Kolmogorov and 
Zames with the Shannon-theoretic notion of information gain 
quantified by the divergence between the actual sequence of 
the system kernels and some sequence of auxiliary stochastic 
kernels. The freedom of choosing these auxiliary kernels is 
what gives the Meta-Theorem its power. We have used the 
Meta-Theorem to derive fundamental lower bounds on the 
performance of system identification algorithms and on the 
minimum time needed to stabilize an uncertain linear system. 
As part of future work, we will investigate fundamental limits 
of robust estimation and control algorithms over uncertainty 
sets defined directly by divergence (relative entropy) con- 
straints [23], [24]. 
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Appendix I 
Proof of LemmaQ] 

For brevity, we will write F t instead of Ft(F*, C/ t_1 ). 
Suppose that the event in (l63l holds for a given A € A. 
Then Gt is invertible, and 



A-A T = Y,(A-F t )Y t Y t T G T 1 . 



(LI) 



Let At = A — F t and H t — Y t Y^ . Then for any two vectors 
u, v e W l we have 




H t G^v 



Y^W^tG^v 



= fx> T A t Y t y t T A^ -v T G^ 
< (j2^A t Y t Y^A] u y^\\v\ 



(L2) 
(1-3) 
(1.4) 
(1-5) 
(1.6) 



where || • || denotes the Euclidean norm on R n , the third 
and the fourth steps use Cauchy-Schwarz, the fifth step uses 
the definition of H t , and the last step uses the persistent 
excitation property. Taking the supremum of both sides of 
( 11.61 ) over all v with \\v\\ = 1 and using the fact that 

A t Ft = (A - F t )Y t = AY t + U t = Y t+1 - V t+ i, (1.7) 

we obtain the bound 

1 T 

\\{A - A T )u\\ 2 <-J2 im+i - V t+1 ) T u\ 2 (1.8) 



t=i 



that holds for all u e M™. Taking the supremum over all 
unit-norm u, we get the lemma. 



Appendix II 
Proof of Lemma|2] 

For every i G [N] define the following events: 



R 



(i) A 
T 



s {w = i}n{||4r 



E, 



, — {w = i} n <j -y- b c/ nx 




> 4e 




Let Pj(-) and E 4 {-} denote IP(-| W = i) and E{-|W = i}, 
respectively. If 7 achieves regret e in time T, then by 
Markov's inequality 



< 



^{tELiII^+i-^+iII 2 } 



4e 



< 



1 

4' 

(11.12) 



Now suppose that W = i, but W ^ i and is false. By 
definition of VK, we must then have 



A W < \\ A T 



(11.13) 



Moreover, since both Ai and belong to the A^/eJc- 

packing set and W ^ i, \\Ai—A^\\ > A^/e/c, Then triangle 
inequality gives 

\\At - AA\ > \\Ai ■ 



This contradicts the assumption that S^) is false. Hence, 



By Lemma Q] S$ n E$ C n Therefore, 



(11.15) 



,(W)\ 



S. 



< 
< 
< 



(W) 
T 



n4 w) ) 
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where the bar denotes set-theoretic complement. Averaging 
w.r.t. the distribution of W, we obtain the statement of the 
lemma. 
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